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1.  Summary 

It  is  frequently  observed  that  the  dimension  of  inputs  is  much  larger  than  the  sample 
size.  Examples  are  image  construction,  microarray  data,  data  mining  etc.  In  such  cases, 
standard  learning  methods  either  are  not  applicable  or  perform  badly.  Also,  identifying 
a  small  subset  of  important  features,  which  discriminate  outputs,  becomes  an  important 
subject.  Hence,  good  learning  algorithms  with  high  dimensional  inputs  should  provides  a 
classification  rule  which  not  only  yields  high  accuracy  but  also  has  an  ability  of  identifying 
few  important  features. 

Apparently,  there  are  two  popularly  used  such  algorithms.  One  is  decision  trees  and  the 
other  is  LASSO.  The  objective  of  this  research  is  to  develop  new  algorithms  for  decision 
tree  and  LASSO  for  improving  computational  power,  prediction  accuracy,  and  ability  of 
detecting  significant  features. 


2.  Results 

In  this  section,  we  summarize  the  research  results.  Details  are  appended. 

For  decision  trees,  we  have  developed  a  new  pruning  algorithm  and  showed  that  com¬ 
bining  with  the  Bumping  procedure,  the  proposed  pruning  algorithm  improves  prediction 
accuracy  significantly.  Moreover,  its  computing  time  is  much  less  than  that  for  the  stan¬ 
dard  method  of  constructing  decision  trees  since  the  proposed  method  does  not  need  the 
cross-validation  step  which  is  usually  computationally  demanding. 

For  LASSO,  we  have  developed  a  new  computational  algorithm  based  on  the  gradient 
descent  method.  The  proposed  algorithm  does  not  require  any  sophisticated  optimization 
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algorithms  and  hence  it  is  much  faster  and  stabler.  Also,  we  derived  the  convergence  rate 
of  the  proposed  algorithm,  and  surprisingly  it  turns  out  that  the  convergence  rate  does 
not  depend  on  the  dimension  of  inputs. 

3.  Contributions  to  machine  learning  and  future  works 

LASSO  has  gained  popularity  recently  due  to  its  superiority  in  accuracy  as  well  as  the 
sparsity.  However,  its  computation  requires  special  optimization  techniques  and  so  it  is 
hardly  applied  to  large  dimensional  data,  in  particular  when  the  loss  function  is  not  the 
square  loss.  The  proposed  algorithm  in  this  research  resolves  this  computational  issues 
completely,  and  hence  LASSO  techniques  can  be  used  freely  to  large  dimensional  data. 

For  future  works,  we  can  develop  faster  and  simpler  algorithm  for  LASSO  and  other 
types  of  sparse  models  such  as  fused  LASSO  and  blockwise  sparse  model.  The  convergence 
rate  of  the  proposed  algorithm  is  m~l  where  m  is  the  number  of  iteration.  We  can  improve 
it  up  to  the  exponential  convergence. 

The  new  pruning  method  for  decision  trees  opens  a  new  direction  of  finding  good 
decision  trees.  In  particular,  when  the  proposed  method  can  be  used  with  ensemble  algo¬ 
rithms  such  as  boosting  and  random  forest,  we  believe  that  more  efficient  and  informative 
decision  trees  can  be  found.  We  leave  this  approach  as  a  future  work. 
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Abstract 

The  cost-complexity  pruning  generates  nested  subtrees  and  selects  the  best  one. 
However,  its  computational  cost  is  large  since  it  uses  hold-out  sample  or  cross- 
validation.  On  the  other  hand,  the  pruning  algorithms  based  on  posterior  calcula¬ 
tions  such  as  BIC  (MDL)  and  MEP  are  faster,  but  they  sometimes  produce  too  big 
or  small  trees  to  yield  poor  generalization  errors.  In  this  paper,  we  propose  an  alter¬ 
native  pruning  procedure  which  combines  the  ideas  of  the  cost-complexity  pruning 
and  posterior  calculation.  The  proposed  algorithm  uses  only  training  samples,  so 
that  its  computational  cost  is  almost  same  as  the  other  posterior-based  algorithms, 
and  at  the  same  time  yield  stable  results  as  the  cost-complexity  pruning.  More¬ 
over  it  can  be  applied  for  comparing  non-nested  trees,  which  is  necessary  for  the 
BUMPing  procedure.  Empirical  results  show  that  the  proposed  algorithm  performs 
similarly  as  the  cost-complexity  pruning  in  the  standard  situation  and  works  better 
for  BUMPing. 
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1  Introduction 


Decision  trees  are  very  popular  models  in  applied  statistics  and  machine  learn¬ 
ing  for  classification  and  regression  (Breiman  et  al.,  1984;  Quinlan,  1993)  since 
they  are  simple  and  easy  to  be  interpreted.  The  typical  procedure  of  the  deci¬ 
sion  tree  induction  consists  of  two  steps,  so  called  growing  and  pruning.  The 
growing  step  constructs  a  fully  grown  tree,  which  is  usually  done  by  finding 
the  best  split  recursively  until  no  more  splits  are  possible.  Then,  the  pruning 
step  selects  the  best  subtrees  of  the  fully  grown  tree. 

Various  pruning  algorithms  have  been  developed  such  as  the  cost-complexity 
pruning(CCP;  Breiman  et.  al,  1984),  reduced-error  pruning(REP;  Quinlan, 
1987),  Bayesian  Information  Criterion  (BIC)  or  equivalently  minimum  de¬ 
scription  length  (MDL)  pruning  (Rissanen,  1978;  Quinlan  and  Rivest,  1989; 
Mehta  et  al.,  1995),  minimum  error  pruning  (MEP,  Niblett  and  Bratko,  1986) 
and  so  on.  These  pruning  algorithms  can  be  classified  into  two  types.  The 
first  type  uses  hold-out  samples,  to  which  CCP  and  REP  belong  to,  and  the 
second  type  including  BIC/MDL  and  MEP  uses  only  training  samples.  For 
the  pruning  methods  of  the  first  type,  a  sufficiently  large  amount  of  hold-out 
samples  should  be  required  for  selecting  the  best  tree.  If  hold-out  samples  are 
not  available,  one  can  resort  to  heuristic  methods  such  as  the  V-fold  cross 
validation (CV),  but  it  demands  more  computations  additionally  and  the  final 
result  is  unstable.  Moreover,  the  CV  is  not  easily  applicable  for  comparing 
non-nested  trees.  One  such  example  is  BUMPing  (Tibshirani  and  Knight, 
1999),  which  first  constructs  many  decision  trees  using  bootstrap  samples  and 
then  searches  the  best  model  within  them.  Therefore,  for  BUMPing,  a  method 
for  comparing  non-nested  decision  trees  is  necessary.  In  contrast,  the  pruning 
methods  of  the  second  type  do  not  pose  these  problems.  However,  the  finally 
selected  tree  tends  to  be  unnecessarily  large  or  small,  which  results  in  lower 
prediction  accuracy  and  arise  difficulty  in  interpreting  the  model.  See  section 
5  for  empirical  evidences. 

The  aim  of  this  article  is  to  develop  a  new  pruning  algorithm  to  resolve  the 
aforementioned  problems.  The  main  idea  of  the  proposed  pruning  algorithm 
is  to  combine  the  ideas  of  the  CCP  and  Bayesian  methods.  The  proposed 
algorithm  first  generates  the  sequence  of  nested  trees  in  the  same  manner 
as  CCP,  and  compares  them  via  calculating  their  posterior  probabilities  to 
choose  the  best  one.  By  doing  so,  we  can  reduce  computing  time  and  avoid 
the  instability  due  to  the  CV.  Also,  we  can  easily  compare  non-nested  trees 
by  their  posterior  probabilities,  and  hence  can  improve  the  performance  of 
BUMPing. 

In  order  to  evaluate  the  posterior  probabilities  of  given  trees,  we  could  use 
BIC  since  it  is  known  as  an  approximation  of  the  posterior  probability.  How- 
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ever,  for  tree  models,  we  observed  that  BIC  sometimes  but  not  often  yields 
unnecessary  large  or  small  trees  (see  Table  3  in  section  5).  Mehta  et  al.  (1995) 
reported  the  similar  results.  This  is  partly  because  BIC  internally  put  equal 
prior  probabilities  on  trees  of  different  sizes  and  the  degree  of  approximation 
is  poor  for  large  sized  trees  since  the  number  of  observations  in  each  terminal 
node  is  not  sufficiently  large.  The  proposed  algorithm  overcomes  these  defi¬ 
ciencies  in  BIC  by  putting  a  size-dependent  prior  and  calculating  the  exact 
posterior  probabilities. 

The  paper  is  organized  as  follows.  Section  2  briefly  describes  the  three  well 
known  pruning  algorithms  -  CCP,  BIC  and  MEP.  Section  3  explains  the  pro¬ 
posed  pruning  algorithm  including  the  way  of  specifying  prior  probabilities  and 
calculating  posterior  probabilities  for  given  decision  trees.  Section  4  presents 
the  BUMPing  technique  based  on  the  proposed  pruning  algorithm.  In  sec¬ 
tion  5,  we  give  the  results  of  empirical  studies  using  real  data  sets.  Finally, 
discussions  follow  in  section  6. 


2  Review  of  Pruning  Methods 


In  this  section,  we  briefly  review  the  three  pruning  methods  based  on  CCP, 
BIC(MDL)  and  MEP,  respectively.  The  objective  of  this  paper  is  to  combine 
the  ideas  of  these  pruning  methods  to  develop  a  new  pruning  method. 


2.1  Cost-complexity  pruning  (CCP) 


Breiman  et  al.  (1984)  developed  the  minimal  cost  complexity  pruning(CCP) 
which  performs  as  follows  : 

(1)  Construction  of  the  set  of  candidate  subtrees  T0,  7j , . . . ,  TK,  according  to 
the  cost-complexity  measures. 

(2)  Selection  of  the  best  tree  T*  among  T0,  7j , . . . ,  Tj<  according  to  estimates 
of  the  generalization  errors  through  E-fold  CV  or  hold-out  pruning  sam¬ 
ples. 

Let  Tmax  denote  the  fully  grown  tree  for  a  given  training  data  set.  For  any 
subtree  T  of  Tmax  and  A  >  0,  Breiman  et  al.  (1984)  defined  the  cost  complexity 
measure  as  follows  : 

RX(T)  =  R(T)  +  \\T\, 

where  R{T )  is  a  misclassihcation  error  of  T  calculated  on  the  training  data, 
and  |Tj  is  the  number  of  terminal  nodes  in  T.  For  a  given  A,  let  T\  be  the 
optimal  subtree  of  Tmax  which  minimizes  R\{T).  Breiman  et  al.  (1984)  showed 
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that  for  the  increasing  values  of  0  =  A0  <  Xi  <  ■  . the  corresponding  sequence 
of  optimal  subtrees  T\0 ,  TXl , . . .  forms  a  nested  structure.  They  proposed  to 
estimate  the  optimal  A  by  minimizing  the  CV  error  or  the  hold-out  sample 
error,  and  then  the  corresponding  tree  T \  is  selected  as  the  final  best  tree. 


2.2  Bayesian  Information  Criterion  (BIC) 


Let  Mi,  M2, ...  be  given  models  and  let  7t(Mi),  7t(M2)  , ...  be  the  prior  prob¬ 
abilities.  For  a  given  model  M,  the  likelihood  and  prior  of  parameters  are 
denoted  by  P(Y\Q,M)  and  7t(@|M),  respectively.  Then  the  posterior  proba¬ 
bility  of  the  model  M  can  be  computed  by  integrating  out  the  parameter  0. 
That  is, 

P(M\Y)  oc  7 r(M)  J  P(Y\Q,M)7r(Q\M)dQ.  (1) 


When  the  integral  part  of  the  above  equation  is  very  complex,  it  is  not  easy 
to  obtain  the  closed  form  directly.  However,  we  can  approximate  P(M\Y )  by 

log P(M\Y)  «  logP(T'|@,M)  -  ^  log  W 

for  sufficiently  large  samples  via  the  Laplace’s  method,  where  d  is  the  num¬ 
ber  of  free  parameters  and  0  is  the  MLE  under  the  model  M.  Using  this 
approximation,  Schwarz  (1978)  defined  BIC  as 

BIC  =  — 2  log  P(F|0,  M)  +  d  log  IV, 

which  is  an  approximation  of  the  negative  of  the  log-posterior,  and  proposed 
to  select  the  model  which  minimizes  BIC. 

Rissanen  (1978)  proposed  the  MDL  principle,  which  seeks  to  minimize  the 
number  of  bits  needed  to  describe  the  data  over  the  class  of  models.  The  total 
description  length  is  defined  by 

l(Y,  0,  M)  =  l(M)  +  l(Q\M)  +  l(Y |0,  M). 

The  first  term  of  the  above  equation  indicates  the  encoding  length  of  the 
model  M,  the  second  encodes  the  parameters,  and  the  third  encodes  the  data 
Y  given  the  model  and  parameters.  The  MDL  of  model  selection  chooses  the 
model  M  and  the  parameters  0  which  minimize  l(Y,  0,  M)  for  the  given  data 
Y.  Applying  Shannon’s  theory,  we  can  show  the  code  length  l(Y |0,  M)  for  the 
data  Y,  equals  the  negative  log  likelihood  —  log  P(Y\Q,  M).  In  addition,  if  we 
accept  the  fact  p(Q\M)  oc  e~l^M\  the  MDL  principle  can  be  interpreted  as 
the  maximization  of  the  logarithm  of  the  posterior  probability  of  the  model, 
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which  is  essentially  the  same  as  minimizing  BIC.  Since  Rissanen  (1978)  first 
proposed  MDL,  many  modifications  have  been  suggested  and  applied  succes¬ 
sively  by  Wallace  and  Freeman  (1987),  Quinlan  and  Rivest  (1989),  Mehta 
et  al.  (1995)  and  Takeuchi  (1997). 


2.3  Minimum.  Error  Pruning  (MEP) 


Niblett  and  Bratko  (1986)  proposed  a  bottom-up  approach  seeking  for  a  single 
tree  that  minimizes  “the  expected  error  rate  on  an  independent  data  set.”  For 
a  A'-class  problem,  the  expected  probability  that  an  observation  at  node  t 
belongs  to  the  i-th  class  is  defined  as 

,  ,  nAt)  nit )  m 

Pi(t  =  -777  , 

n{t)  n[t)  +  m  n[t)  +  m 

where  n(t)  is  the  number  of  observations  in  node  t,  rij(f)  is  the  number  of 
observations  in  node  t  whose  class  level  is  i,  7p  is  the  prior  probability  of  the 
i-th  class,  and  m  is  a  tuning  parameter  that  determines  the  impact  of  the  prior 
probability.  Cestnik  and  Bratko  (1991)  named  Pi(t)  as  m-probability  estimate. 
Actually,  Pi{t)' s  are  Bayes  estimators  when  we  assume  that  the  distribution 
of  class  levels  are  multinomial,  and  the  prior  probabilities  are  distributed  ac¬ 
cording  to  the  Dirichlet  distriution. 

By  letting  m  =  K  and  7r *  =  1/A',  i  =  1, . . . ,  K,  i.e,  the  prior  probabilities  are 
equal  for  the  all  classes,  Niblett  and  Bratko  (1986)  defined  the  expected  error 
rate  as 


EER(t)  =  rninjl  —  Pi(t)} 

i 


min 

i 


n{t) 


-  ni(t)  +  K  —  1 
n(t)  +  K 


Using  this  measure,  minimum  error  pruning  is  carried  out  as  follows  : 

(1)  Start  with  the  full  grown  tree. 

(2)  For  each  internal  node  t, 

•  Compute  the  expected  error  :  EER(t ),  and 

•  Compute  the  error  of  the  sub-tree  given  by  :  EER(Tt )  =  J2seTt  PsEER(s), 
where  ps  =  n(s ) /nit)  is  the  proportion  of  the  number  of  samples  in  node 

s. 

•  A  node  t  is  pruned  if  EERit)  <  EER(Tt),  or  not  otherwise. 
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3  Maximum  a  posteiori  pruning  algorithm 


In  this  section,  we  propose  a  new  pruning  algorithm  which  combines  the  ideas 
of  the  CCP  and  Bayesian  approach.  First,  we  explain  how  to  calculate  the 
posterior  probability  of  a  given  tree,  and  then  explain  how  to  use  the  posterior 
probabilities  to  select  the  best  tree. 

The  structure  of  a  given  tree  T  is  composed  of  the  terminal  nodes  T  and 
parameters  0  =  (0i,  - .  ■  ,0f)  at  the  terminal  nodes.  Let  yti  denote  the  i-th 
observation  in  the  f-th  terminal  node,  t  —  1, ...,  |T|,  i  —  1, nt,  and  let 

Y  =  (Yi, ...,  Y^y,  where  Yt  =  (ytl, ...,  ytnt)'- 


For  a  classification  tree,  it  is  typically  assumed  that,  conditionally  on  (0,  T),  y 
values  within  the  terminal  node  are  independent  and  identical  distributed,  and 
y  values  across  terminal  nodes  are  independent.  In  addition,  since  ytl  belongs 
to  one  of  K  categories  (i.e.  K  class  classification  problem),  we  assume  f(yu\9t) 
to  be  a  multinomial  distribution.  Then  the  likelihood  becomes 

m  nt  \t\ 

p(Y\e,  i )  -  n  n  f(vu\e,) = n  j>!f‘  •  •  •*>&*, 

t= 1  i= 1  t= 1 

where  9t  =  (pa,  ■■■,PtK),Ptk  >  0,  J2kPtk  =  1  and  ntk  is  the  number  of  observa¬ 
tions  in  node  t  belonging  to  the  class  k. 

For  a  prior  on  T,  we  need  to  choose  two  priors  7 r(T)  and  7t(0|T),  one  for  the 
structure  of  T  and  the  other  for  0  given  T.  To  specify  7t(T),  we  mimic  the 
prior  developed  by  Chipman  et  al.  (1998).  They  considered  the  node  splitting 
probability  and  the  selection  probabilities  of  the  split  variables.  However  we 
use  only  the  node  splitting  probability.  The  reason  for  this  modification  is  as 
follows.  The  proposed  pruning  algorithm  chooses  the  optimal  tree  among  the 
sequence  of  nested  trees  already  constructed  from  the  growing  step.  Hence, 
the  random  quantity  in  the  tree  structure  is  the  size  of  the  tree,  but  not  the 
split  variables. 

For  a  given  node  t,  the  split  probability  is  defined  by 

Pr (node  split )  =  a(l  +  dt)~ ^ 

where  dt  is  the  depth  of  the  node  t  and  the  parameters  a  G  (0, 1)  and  f3  >  0 
control  the  size  and  shape  of  the  tree.  That  is,  the  larger  a  is,  the  smaller 
the  probability  of  the  bushed  tree  is.  On  the  other  hand,  for  increasing  /3,  it 
is  difficult  to  split  at  deeper  node.  Using  the  above  splitting  probability,  the 
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total  prior  mass  for  a  given  tree  T  is  specified  by 


7 r(T)=  n  ^i+^na-^+^A 

t&T-T  tef 


(2) 


For  tt(0|T),  we  assume  a  priori  7t(0|T)  =  ni=i  7t(^|T)  and  n(6t\T)  are  Dirich- 
let  distributions  with  parameters  ('jti,  ■■■-iltK)  which  are  conjugate  to  the 
multinomial  distribution. 

Once  the  likelihood  and  priors  are  specified,  we  can  calculate  the  posterior 
probability  of  a  tree  T  given  data  Y  from  the  relation 

P{T\Y)  oc  P(Y\T)n(T),  (3) 


where  P(Y\T )  =  /  P(F|0,  T)7i(Q\T)dQ,  which  turns  out  to  be 


p(y\t) = ff  .  n  rA  +,Tt) 

t=i  V  v(nt  +  Efe  ik)  feii  r (7fc) 
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(4) 


provided  -)tk  =  7fc  for  all  t. 

Now,  we  present  the  proposed  pruning  algorithm  using  the  posterior  probabil¬ 
ities  (3)  through  the  prior  (2)  and  the  marginal  likelihood  (4).  For  all  subtrees 
of  initial  fully  grown  tree,  we  could  evaluate  the  posterior  probabilities  and 
then  select  the  maximum  a  posteriori(MAP)  tree.  However,  the  number  of  all 
subtrees  evaluated  is  very  large,  which  requires  much  computational  resource. 
So,  we  need  a  reduced  set  of  subtrees  instead  of  all  possible  subtrees,  and  the 
CCP  is  an  elegant  algorithm  for  it. 

The  following  pruning  algorithm,  called  CCP-MAP,  combines  CCP  and  the 
maximum  a  posteriori  strategy.  That  is,  it  first  generates  a  sequence  of  nested 
trees  using  CCP,  and  then  calculates  the  posteriori  probabilities  of  the  nested 
trees  to  select  the  optimal  one. 


Algorithm  1.  MAP-pruning  via  cost-complexity  pruned  trees(CCP-MAP) 


1  Grow  the  tree  up  to  its  maximum  size; 

2  Get  the  sequence  of  nested  pruned  trees  {Tj, . . . ,  Tn}  using  CCP; 

3  for  %  —  1  to  n  do 

Compute  the  posterior  probability  P(Ti\data)  from  the  eq.  (3); 

end  for 

4  Selection  :  TMAP  =  arg  max{P(Ti\data),i  =  1, . . . ,  n}; 
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4  BUMPing  based  on  MAP-pruning 


An  important  application  of  the  proposed  CCP-MAP  pruning  algorithm  is  the 
BUMPing  procedure  suggested  by  Tibshirani  and  Knight  (1999).  BUMPing 
is  a  method  to  select  the  best  one  among  the  set  of  trees  constructed  on 
bootstrap  samples.  In  the  problems  where  there  are  many  local  minima  (or 
local  maxima)  such  as  decision  trees,  BUMPing  can  help  avoiding  getting 
stuck  in  a  poor  solution. 


Let  C  be  the  training  sample  and  Cb  be  a  bootstrap  sample  obtained  from  C. 

The  standard  algorithm  of  BUMPing  is  as  follows. 

(1)  Let  R\(T)  =  R(T)  +  \\T\  be  the  cost-complexity  measure  for  a  given  tree 
T  where  R  is  the  misclassihcation  error. 

(2)  Estimate  the  complexity  parameter  A  =  arg  rniri^  R\(T)  in  the  usual  way 
from  the  training  sample  C  by  cross-validation. 

(3)  For  each  bootstrap  samples  Cb,  b  =  find  the  best  tree  Tb  = 

Tx{Cb). 

(4)  Choose  the  tree  T*  =  arg  min  R{Tb)r  where  R  is  evaluated  from  the 
original  training  sample  C. 


One  problem  of  the  standard  BUMPing  algorithm  is  that  the  optimal  com¬ 
plexity  parameter  A  obtained  from  the  original  sample  C  may  not  be  optimal 
for  bootstrap  samples.  For  this  reason,  BLIMPing  may  yield  sub-optimal  re¬ 
sults.  To  improve  the  performance  of  BUMPing,  we  can  apply  CCP-MAP  to 
the  BLIMPing  algorithm  instead  of  the  CV.  This  strategy  makes  it  possible 
to  compare  larger  number  of  trees  than  usual  BUMPing,  so  that  gives  more 
chances  of  hireling  the  optimal  one.  The  proposed  BUMPing  algorithm  based 
on  CCP-MAP,  called  BUMP-MAP,  is  given  as  follows. 


Algorithm  2.  BUMPing  based  on  CCP-MAP  (BLIMP-MAP) 


1  for  b  =  1  to  B  do 

Make  a  bootstrap  sample  Cb  from  C; 

Grow  tree  T ^  from  Cb; 

Construct  a  sequence  of  pruned  trees  via  CCP:  {T^b\  k  —  1, . . . ,  rib}; 
Compute  P(T^6)|£)  from  the  training  data  C; 

end  for 

2  Choose  the  tree  having  the  maximum  a  posteriori  probability; 


5  Empirical  Results 


5.1  Experimental  setup 


In  this  section,  we  compare  the  proposed  method  with  other  pruning  algo¬ 
rithms  by  analyzing  10  real  data  sets  listed  in  Table  1,  which  are  available 
in  UCI  machine  learning  repository  (http://www.ics.uci.edu/~mlcarn  /ML- 
Repository.html).  For  evaluation  of  the  performance,  we  first  divide  the  data 
set  randomly  into  a  training  set  (70%)  and  test  set  (30%).  This  random  di¬ 
vision  of  the  data  set  is  repeated  20  times.  All  the  following  results  are  the 
summary  over  the  20  iterations. 


For  CCP-MAP  pruning,  we  should  determine  the  hyperparmeters  cc,  (3  and 
7’s  appropriately.  We  use  a  pair  (0.9, 1.0)  for  (a, /3)  that  produces  the  best 
results  empirically.  For  7’s,  we  choose  l’s  on  two-class  problems  because  the 
corresponding  Dirichlet  distribution  becomes  the  uniform  distribution  which 
is  thought  to  be  a  noninformative  prior.  However,  in  the  case  of  multi-class 
problems,  this  choice  results  in  poor  accuracies  since  the  total  contribution 
of  the  prior  to  the  posterior  measured  by  J2k  7 k  is  too  large.  Thus  we  set  the 
total  contribution  to  2  as  is  the  case  for  the  two-class  problem.  That  is,  we  set 
7/c  =  2/ K  for  Jl -class  problems. 


Table  1 


characteristics  of  the  d 

ata  sets  used 

.  for  the  experiments 

# 

Data  Sets 

no.  of  obs. 

no.  of  classes 

no.  of  input  vaiables 

(1) 

Australian 

690 

2 

14 

(2) 

Breast  Cancer 

683 

2 

9 

(3) 

German 

1,000 

2 

20 

(4) 

Glass 

214 

6 

10 

(5) 

Ionosphere 

351 

2 

34 

(6) 

Iris 

150 

3 

3 

(7) 

Diabetes 

768 

2 

8 

(8) 

Sonar 

210 

2 

60 

(9) 

Vehicle 

846 

4 

18 

(10) 

Vowel 

999 

11 

10 
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5.2  Comparison  of  four  pruning  methods 


We  compare  the  four  pruning  methods  -  CCP-CV,  CCP-MAP,  CCP-BIC  and 
MEP.  Table  2  and  3  summarize  the  average  generalization  errors  and  sizes  of 

Table  2 

The  averaged  misclassification  error  rates(%)  and  standard  errors  for  the  four  prun¬ 
ing  methods 


Pruning  methods 


Data  sets 

CCP-CV 

CCP-MAP 

CCP-BIC 

MEP 

Australian 

15.12 

(0.51) 

15.05 

(0.51) 

14.98 

(0.53) 

15.82 

(0.56) 

Breast  Cancer 

5.07 

(0.28) 

5.14 

(0.32) 

5.02 

(0.31) 

5.17 

(0.32) 

German 

27.67 

(0.64) 

28.70 

(0.66) 

29.13 

(0.61) 

30.90 

(0.51) 

Glass 

34.69 

(1.19) 

35.23 

(1.16) 

35.86 

(1.12) 

34.92 

(1.16) 

Ionosphere 

10.76 

(0.72) 

12.67 

(0.61) 

12.24 

(0.65) 

12.52 

(0.55) 

Iris 

6.78 

(0.65) 

6.56 

(0.67) 

6.44 

(0.60) 

6.89 

(0.70) 

Diabetes 

25.33 

(0.40) 

25.33 

(0.50) 

26.24 

(0.56) 

28.87 

(0.60) 

Sonar 

26.05 

(1.12) 

25.08 

(1.22) 

25.65 

(1.01) 

25.40 

(1.28) 

Vehicle 

30.33 

(0.77) 

30.31 

(0.47) 

30.04 

(0.34) 

29.72 

(0.39) 

Vowel 

35.37 

(0.77) 

32.64 

(0.75) 

42.59 

(0.92) 

33.55 

(0.66) 

Table  3 

The  averaged  size  of  trees  and  standard  errors  for  the  four  pruning  methods 


Pruning  methods 


Data  sets 

CCP-CV 

CCP-MAP 

CCP-BIC 

MEP 

Australian 

12.15 

(2.23) 

14.30 

(0.88) 

17.05 

(0.65) 

34.85 

(0.56) 

Breast  Cancer 

5.50 

(0.35) 

10.75 

(0.43) 

8.45 

(0.25) 

12.10 

(0.27) 

German 

10.10 

(1.35) 

24.00 

(1.82) 

44.65 

(1.33) 

73.70 

(0.88) 

Glass 

10.65 

(0.76) 

9.80 

(0.40) 

6.65 

(0.29) 

13.10 

(0.53) 

Ionosphere 

4.60 

(0.50) 

9.45 

(0.30) 

8.35 

(0.39) 

12.15 

(0.27) 

Iris 

3.35 

(0.17) 

3.65 

(0.15) 

3.25 

(0.10) 

4.90 

(0.16) 

Diabetes 

8.75 

(1.63) 

17.45 

(1.02) 

31.65 

(1.17) 

55.45 

(0.55) 

Sonar 

6.65 

(0.56) 

10.20 

(0.33) 

9.20 

(0.21) 

11.55 

(0.14) 

Vehicle 

27.80 

(2.43) 

21.40 

(1.12) 

31.40 

(1.18) 

57.70 

(0.49) 

Vowel 

53.50 

(2.69) 

58.85 

(0.74) 

36.10 

(0.97) 

55.65 

(0.88) 
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the  final  trees  for  20  randomly  partitioned  data  sets. 

CCP-CV  and  CCP-MAP  yield  similar  generalization  errors.  For  tree  sizes,  the 
standard  errors  of  CCP-CV  are  consistently  larger  than  those  of  CCP-MAP, 
though  the  averaged  tree  sizes  of  CCP-CV  tend  to  be  smaller.  This  implies 
that  CCP-MAP  gives  more  stable  results.  We  believe  that  the  instability  in 
CCP-CV  is  mainly  due  to  the  instability  of  the  CV. 

MEP  tends  to  yield  larger  trees.  In  particular,  for  two  data  sets  (German  and 
Diabetes),  the  final  trees  of  MEP  are  unnecessarily  large.  It  seems  that  MEP 
could  overfit  the  training  data  (i.e.  lower  accuracies).  The  results  of  BIC  are 
data  dependent.  For  German  and  Diabetes,  the  tree  sizes  are  too  large  that 
the  accuracies  are  worse  than  those  of  CCP-CV  and  CCP-MAP.  On  the  other 
hand,  for  Vowel,  the  resulting  tree  size  is  too  small  to  give  the  lowest  accuracy. 

5.3  Comparisons  of  the  two  bumping  methods 

We  also  compare  the  performance  of  the  BUMPing  procedures  based  on  the 
CV  and  MAP.  From  Table  4,  we  can  see  that  BUMP-MAP  improves  signifi¬ 
cantly  standard  BUMPing  (BUMP-CV)  in  most  cases  (8  out  of  10  data  sets). 
For  the  two  data  sets  Vehicle  and  Vowel  where  the  accuracies  of  BUMP-MAP 
are  lower  that  those  of  BUMP-CV,  the  tree  sizes  of  BUMP-MAP  are  too  small 
compared  to  BUMP-CV.  However,  note  that  the  MAP  pruning  algorithm  can 
control  the  sizes  of  trees  by  controlling  the  prior  parameters  (i.e.  a  and  f3 

Table  4 

The  averaged  tree  sizes  and  error  rates  by  the  two  BUMPing  procedures 


Data  set 

tree  sizes(s.e) 

average  errors(s.e) 

BUMP-CV 

BUMP-MAP 

BUMP-CV 

BUMP-MAP 

Australian 

7.55  (1.00) 

14.95  (0.74) 

13.72  (0.48) 

11.98  (0.51) 

Breast  Cancer 

5.35  (0.41) 

6.60  (0.21) 

6.40  (0.38) 

4.14  (0.25) 

German 

17.30  (3.02) 

28.45  (0.82) 

25.08  (0.72) 

21.37  (0.65) 

Glass 

9.85  (0.70) 

7.95  (0.38) 

34.61  (1.47) 

30.55  (0.98) 

Ionosphere 

5.25  (0.36) 

7.15  (0.22) 

12.19  (0.75) 

8.67  (0.70) 

Iris 

3.35  (0.15) 

3.00  (0.00) 

6.22  (0.70) 

5.56  (0.52) 

Diabetes 

14.35  (2.05) 

22.70  (0.83) 

23.83  (0.52) 

20.30  (0.57) 

Sonar 

5.60  (0.41) 

7.25  (0.27) 

27.34  (1.57) 

20.48  (0.85) 

Vehicle 

26.60  (1.13) 

18.75  (0.53) 

24.82  (0.62) 

25.93  (0.57) 

Vowel 

48.30  (1.70) 

27.40  (0.48) 

33.57  (0.74) 

39.46  (0.69) 
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in  eq.(2)).  By  doing  so,  we  can  improve  the  performance  of  BUMP-MAP  for 
these  two  data  sets. 


6  Discussions 


BUMP-MAP  can  be  compared  with  Bayesian  CART  proposed  by  Chipman 
et  al.  (1998).  The  common  feature  of  the  two  algorithms  is  to  search  the 
maximum  a  posteriori  tree.  The  difference  is  that  the  candidate  trees  are 
constructed  by  the  MCMC  algorithm  in  Bayesian  CART  while  BUMP-MAP 
generates  them  using  bootstrap  samples  and  applying  the  CCP  algorithm. 
One  disadvantage  of  Bayesian  CART  is  that  the  MCMC  algorithm  requires 
too  much  computation  so  that  it  cannot  be  applied  to  large  data  sets  easily. 
It  is  clear  that  the  proposed  method  needs  much  less  computation. 


Instead  of  choosing  the  best  one  from  many  trees  in  BUMP-MAP,  we  can 
combine  all  trees  into  one  final  model  similarly  to  bagging  (Breiman,  1996). 
While  all  trees  are  averaged  out  in  bagging,  we  can  use  the  weighted  average 
of  trees  with  weights  proportional  to  their  posterior  probabilities.  That  is, 

T(x)  =  J2WiTi(X)’ 

i= 1 


where  wy  =  •  This  aPProach  can  be  considered  as  an  approximation 

of  Bayesian  model  averaging.  We  will  pursue  this  idea  in  future. 
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Abstract 


LASSO  is  an  useful  method  to  achieve  the  shrinkage  and  variable  selection  simultaneously. 

The  main  idea  of  LASSO  is  to  use  the  L\  constraint  in  the  regularization  step.  Starting  from 
linear  models,  the  idea  of  LASSO  -  using  the  L\  constraint,  has  been  applied  to  various 
models  such  as  wavelets,  kernel  machines,  smoothing  splines,  multiclass  logistic  models 
etc.  These  models  extend  the  standard  LASSO  in  two  ways  -  there  is  more  than  one  L\ 
constraint  in  the  regularization  step  and  there  is  more  than  one  linear  combination  of  inputs 
in  the  model.  We  call  such  models  the  generalized  LASSO  models.  In  this  paper,  we  propose 
a  new  computational  algorithm  for  the  generalized  LASSO,  which  is  computationally  much 
simpler  than  standard  algorithms  such  as  QP  or  non-linear  program.  We  also  prove  that 
the  convergence  speed  is  fast  and  does  not  depend  on  the  dimension  of  inputs.  Simulations 
show  that  the  proposed  algorithm  is  fairly  fast  and  provides  reliable  results.  To  illustrate  its 
computing  power  with  high  dimensional  data,  we  analyze  real  data  sets  using  the  structural 
sparse  kernel  machine  and  multiclass  logistic  regression  model. 

Keywords:  Gradient  descent,  LASSO,  Multiclass  logistic  model,  Sparse  kernel  machines, 
Variable  selection 

1.  Introduction 

Tibshirani  (1996)  introduced  an  interesting  method  for  shrinkage  and  variable  selection 
in  linear  models,  called  “Least  Absolute  Shrinkage  and  Selection  Operator  (LASSO)”.  It 
achieves  better  prediction  accuracy  by  shrinkage  as  the  ridge  regression,  but  at  the  same 
time,  it  gives  a  sparse  solution,  which  means  that  some  coefficients  are  exactly  0.  Hence, 
LASSO  can  be  thought  to  achieve  the  shrinkage  and  variable  selection  simultaneously. 

The  main  idea  of  LASSO  is  to  use  the  L\  constraint  in  the  regularization  step.  That 
is,  the  estimator  is  obtained  by  minimizing  the  empirical  risk  subject  to  that  the  L\  norm 
of  the  regression  coefficients  is  bounded  by  a  given  positive  numbers.  Starting  from  linear 
models,  the  idea  of  LASSO  -  using  the  L\  constraint,  has  been  applied  to  various  models 
such  as  wavelets  (Chen  et  al.,  1999;  Bakin,  1999),  kernel  machines  (Gunn  and  Kandola, 
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2002;  Roth,  2004),  smoothing  splines  (Zhang  et  al.,  2003),  multiclass  logistic  regressions 
(Krishnapuram  et  al.,  2004)  etc.  These  models  extend  the  standard  LASSO  in  two  ways  - 
there  is  more  than  one  L\  constraint  in  the  regularization  step  (Gunn  and  Kandola,  2002; 
Zhang  et  al.,  2003)  and  there  is  more  than  one  linear  combination  of  inputs  in  the  model 
such  as  the  sparse  multiclass  logistic  model  (Krishnapuram  et  al.,  2004).  We  call  such 
models  the  generalized  LASSO  models,  whose  details  are  given  in  section  2. 

One  important  issue  in  the  generalized  LASSO  is  that  the  objective  function  is  not 
differentiable  due  to  the  L\  constraint,  and  hence  special  optimization  techniques  are  nec¬ 
essary.  Tibshirani  (1996)  used  the  quadratic  programming  (QP)  method  for  least  squares 
regressions  and  the  iteratively  reweighted  least  squares  procedure  (IRLS)  with  QP  for  logis¬ 
tic  regressions.  Osborne  et  al.  (2000)  proposed  a  faster  QP  algorithm  for  linear  regressions, 
which  was  extended  for  logistic  regressions  and  implemented  by  Lokhorst  et  al.  (1999)  as 
lasso2  package  in  R  system.  Recently,  Efron  et  al.  (2004)  developed  an  algorithm  called 
LARS  for  linear  regressions,  which  is  shown  to  be  closely  related  to  Osborne  et  al.  (2000) ’s 
algorithm.  The  aforementioned  algorithms,  however,  may  not  be  easily  applicable  to  large 
data  sets  when  the  dimension  of  input  variables  (or  features)  is  very  large.  One  such  ex¬ 
ample  is  the  structural  sparse  kernel  machine  (see  Example  3  in  Section  2),  where  the 
dimension  of  inputs  is  proportional  to  the  sample  size.  This  computational  drawback  is 
mainly  due  to  that  these  algorithms  need  repeated  matrix  inversions  whose  computing  cost 
increases  very  fast  as  the  dimension  of  inputs  increases.  Also,  these  algorithms  are  not 
directly  applicable  to  the  generalized  LASSO  models  where  there  are  more  than  one  L\ 
constraint  and/or  there  are  more  than  one  linear  combination  of  inputs.  There  are  some 
attempts  to  resolve  these  computational  issues.  Grandvalet  and  Canu  (1999)  implemented 
a  fixed  point  algorithm  using  the  equivalence  between  the  adaptive  ridge  regression  and 
LASSO,  and  Perkins  et  al.  (2003)  developed  a  stagewise  gradient  descent  algorithm  called 
grafting  in  the  more  general  set-up  of  LASSO.  These  algorithms,  however,  may  not  lead 
global  convergence. 

In  this  paper,  we  propose  a  gradient  descent  algorithm  for  the  generalized  LASSO. 
The  proposed  algorithm  does  not  require  matrix  inversions,  so  that  it  can  be  applied  to 
data  sets  with  large  dimensional  inputs.  Moreover,  the  algorithm  always  converges  to  the 
global  optimum,  and  its  convergence  speed  is  near  optimum  among  sequentially  updating 
algorithms. 

The  paper  is  organized  as  follows.  In  Section  2,  we  define  the  general  set-up  of  the 
generalized  LASSO  and  give  four  examples.  In  Section  3,  we  propose  a  gradient  descent 
algorithm  for  the  generalized  LASSO,  and  study  its  theoretical  properties  in  Section  4. 
Section  5  carries  out  to  compare  the  proposed  algorithm  with  Osborne  et  al.  (2000) ’s  QP 
algorithm  in  the  standard  LASSO  framework.  The  proposed  algorithm  is  applied  to  a 
structural  sparse  kernel  machine  and  a  multiclass  logistic  regression  in  Section  6  and  Section 
7,  respectively.  Concluding  remarks  follow  in  Section  8. 

2.  The  set-up  of  generalized  LASSO 

We  first  present  the  set-up  of  the  generalized  LASSO  and  give  four  examples.  Let  (yi,xi), 
. . .  ,  (yn,xn)  be  n  output /input  pairs  where  y*  £  T  and  x,  €  X  where  A  is  a  subspace  of 
Rp,  the  p-dimensional  Euclidean  space.  That  is,  x,  =  (xn, . . . ,  XiP).  Let  (3  =  (/3\, . . . ,  /3P)  be 
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the  corresponding  regression  coefficients.  For  a  given  loss  function  C,  the  objective  of  the 
generalized  LASSO  is  to  find  /3  which  minimizes  the  (empirical)  risk 

n 

R((3)  =  (i) 

2—1 


subject  to  1 1/3*| |i  <  Xi  for  l  =  1, . . . ,  d  where  0  =  (p[, . . . ,  /3lpi)  are  pi  dimensional  subvectors 
of  /3  for  l  =  1  such  that  /3  =  (/31, . . . ,  /3d).  Here  A;s  are  tuning  parameters  and 

ll/3'lli  =  EfcLi  \Plk\-  In  (!)>  . . .  ,/3d®xf)  and  /3z®x(  =  (p[xla, . .  .,PlPlx\Pl) 

where  x(  =  (xln , . 


. . ,  xipi)  are  pi  dimensional  subvectors  of  x,;  for  l  =  1 .....  cl  such  that 


x;  =  x; 


Example  1.  Multiple  linear  regression 

Let  (y±,  zi), . . . ,  (yn,  zn)  be  n  observations  where  ms  are  outputs  and  z,s  are  p-dimensional 
inputs.  A  multiple  linear  regression  model  is  given  by 

Vi  —  P  1  A  1  PpZip  A  (2) 

where  e^s  are  assumed  to  be  zero-mean  random  quantities.  The  LASSO  estimate  Pi, . . .  ,Pp 
for  the  linear  regression  model  (2)  is  the  minimizer  of 

n  /  p 

'y  ]  (  Vi  ~  y  ]  Pkzik 

i= 1  V  k=  1 

subject  to  ELi  I  A;  I  <  This  problem  can  be  embedded  into  the  set-up  of  the  generalized 
LASSO  with  d  =  1,  the  loss  function  C(y,  v)  =  (y  —  EEi  vk)2  for  v  =  (iq, . . . ,  vp),  x*  =  z 
(3  =  (3l  =  (Pi,  ...,Pp)  and  Ai  =  A. 

Remark.  The  model  (2)  is  not  of  standard  since  there  is  no  intercept  term.  The  standard 
multiple  linear  regression  model  is 


Vi  —  Po  +  PlZil  +  ■  ■  ■  +  PpZip  +  €{■ 


(3) 


The  LASSO  estimates  (Po  and  Pi, ... ,  Pp)  then  can  be  obtained  by  minimizing 

n  /  p 

yy  I  Vi  -  Po  -  yy  PkZH 
i= 1  V  k= 1 

subject  to  ELi  IAI  <  A.  Note  that  Po  is  unconstrained.  This  LASSO  problem  can  be 
embedded  into  the  generalized  LASSO  by  letting  d  =  2,  x.;  =  (l,Zj)  and  p1  =  (Po),02  = 
(Pi, . . . ,  Pp)  and  Ai  =  oo,  A2  =  A.  In  practice,  we  can  use  a  sufficient  large  number  for  Ai 
such  that  /Iq*1"3  <  Ai  where  /S®1"3  is  the  OLS  estimate  of  Po-  To  sum  up,  the  models  with 
(unconstrained)  intercept  terms  can  be  easily  accommodated  with  the  generalized  LASSO 
set-up  without  intercept  terms.  Hence,  for  ease  of  notation,  in  the  followings,  we  only 
consider  the  models  without  intercept  terms. 
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Example  2.  Multiple  logistic  regression 

A  multiple  logistic  regression  model  for  two  class  classification  problems  is  given  by 


Pr  (yi  =  l|z  j) 

1  -  Pr  (yi  =  ljz  i) 


f(zi ) 


(4) 


where 

/(zj)  =  Pi zn  H - b  PpZip. 

Here,  yi  £  {0, 1}.  The  LASSO  estimate  Pi, . . . ,  Pp  is  the  minimizer  of  the  negative  binomial 
log  likelihood 
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Y  \~yif(zi)  +  log  (l  +  e/(zi)) 

1=1 

subject  to  X^a—1  lAfcl  —  The  generalized  LASSO  setting  of  a  multiple  logistic  regression 
is  the  same  as  that  of  the  multiple  linear  regression  except  for  the  loss  function 


£(y,v) 


T  log 


1  +  exp 


Example  3.  Structural  sparse  kernel  machine 

The  structural  sparse  kernel  machine  with  second  order  interaction  terms  for  classification  is 
basically  the  same  as  (4)  except  that  /( z)  is  modelled  via  the  functional  ANOVA  (analysis 
of  variance)  decomposition 

p 

f(z)  =  Y  +  Y  fMzi’  Zkl  (5) 

j=l  j<k 

Then,  we  assume  that  fj  and  fjj-  are  elements  of  the  reproducing  kernel  Hilbert  space 
(RKHS)  Tij  and  Tijk  with  the  reproducing  kernels  Kj  and  Kjk,  respectively.  Motivated  by 
Kimeldorf  and  Wahba  (1972),  we  assume  that 

n 

fj(zj )  =  'Y  Cj,rKj  ( zrj )  zj  ) 
r= 1 

and 

n 

fjk{zji  zk)  —  ^  ^  Cjk,rKjk((Zrji  zrk)’>  zk)')m 
r= 1 

This  model  is  considered  by  many  authors  including  Gunn  and  Kandola  (2002)  and  Zhang 
et  al.  (2003).  Zhang  et  al.  (2003)  proposed  to  estimate  c^?r  and  Cjk,r  by  minimizing 

n 

Y^(ViJ(zi)) 

1=1 

subject  to  Ylr= 1  YjPj= 1  \cj,A  <  Ai  and  Ylr= 1  j<k  \cjk,A  <  A2.  This  problem  can  be  embed¬ 
ded  into  the  setting  of  the  generalized  LASSO  with  d  =  2,  Xj  =  (xf,x?)  and  /3  =  (ft1,  ft2) 
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where  x*  =  ( Kj(zrj ,  %•),  j  =  1  ,...,p,r=  1, . . . ,  n)  and  x2  =  (Kjk((zrj,  zrk),  (%,  zik)),j  < 
k,r  =  1  /31  =  (c^.j  =  and  /32  =  (cjkjr,j  <  k,r  =  l,...,n). 

Note  that  the  number  of  parameters  to  be  estimated  is  proportional  to  the  sample  size  re. 
Hence,  the  dimension  of  inputs  can  be  very  large  when  the  sample  size  is  large  even  though 
the  dimension  of  the  original  inputs  (i.e.  p )  is  small. 


Example  4.  Multiclass  logistic  regression  model 
A  J-class  logistic  regression  model  is  given  by 


Pr  (yi  =  j\zi) 


exp 


(iUi) 


z. 


i  1 


+  •  •  •  +  ffl’z. 


U)Jj) 


IP 


sr^  J 

Lm=l  eXP 


m)  (m) 


1 


Z, 


i  1 


+ 


,  oM  Jm) 

i-  Pp  Zip 


for  j  =  1 , ,J.  For  identification  of  the  model,  we  let  /?[,  =  0  for  A;  =  1, ...  ,p.  For  a 

(7) 

sparse  solution,  we  estimate  f3Y 's  by  maximizing  the  log-likelihood 


E 

i= 1 


^2I(yi=  j)/i(z?})  -  log  (  Yl  exp  (z 


3= 1 


subject  to  certain  L\  constraints  on  (3k^s  where  z'f1  =  (z^1 , . . . ,  z\p)  and 


v  771=1 


So) 


Sm) 


Si) 


for  j  =  1, . . . ,  J—  1  and  fj{  zf^)  =  0.  If  we  put  the  constraint  YZj=i  YlVk= 1  l/^l  <  A  (Krish- 
napuram  et  al.,  2004),  this  is  a  generalized  LASSO  model  with  d  =  1,  Xj  =  (z-1^, . . . ,  zj'^), 


(3  =  (0k  \  k  =  1, ...  ,p,j  =  1, ...,  J  —  1)  and  the  loss  function  £  given  by 


j-i 


£(y>  v)  =  -  ^  I(y  =  j)  vk  +  log  {  Yl  exp  vk  +  1 


(6) 


3= 1 


fc=i 


7=1 


vfc=i 


where  v  =  (v1, . . . ,  vJ_1)  and  =  (v{, ... ,  Vp). 

In  some  cases,  we  want  to  put  a  different  L\  constraint  on  each  (3^  = 
that  is  ELi  \fihP  —  Aj ,  j  =  1, . . . ,  J  —  1.  This  is  also  a  generalized  LASSO  model  with 
d  =  J  —  1,  x*  =  (zj, . . . ,  zf”1),  and  (3  =  (/31, . . . ,  (3J~1)  and  the  loss  function  C  given  in  (6). 


3.  A  gradient  descent  algorithm  for  generalized  LASSO 

In  this  section,  we  will  present  a  gradient  descent  algorithm  to  solve  the  optimization 
problem  in  the  generalized  LASSO.  Let  wl  =  f3l /\i  and  let  x,;  =  (x) , . . . ,  5cf)  where  x-  = 
A;x^.  Then,  we  can  change  (1)  to  the  equivalent  problem  which  finds  w  =  (w1, . . . ,  \vd)  by 
minimizing 

n 

R( w)  =  C  ( yi,  w  <g>  Xj) ,  subject  to  | |wz| |i  <  1,  l  =  1, . . . ,  d.  (7) 

i— 1 
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To  describe  the  proposed  gradient  descent  algorithm,  we  introduce  some  notations.  Let 

zi  =  (*Ln  ■  ■  ■ ,  4n)'  and  let 

Pi 

s1  =  {{w\w\, . . .  ,wlpypi)  :  vi  €  {zlk,-z[},wlk  >  0  =  1}- 

k= 1 

That  is,  5^  is  the  set  of  n  x  pi  dimensional  matrices  whose  column  vectors  are  wlkvlk  for 
k  =  1 , ,pi.  Finally,  for  given  nxpi  dimensional  matrices  £  Sl  for  l  =  1, . . . ,  d,  let  F  be  a 
nxp  dimensional  matrix  defined  by  F  =  (F1, . . . ,  Fd)  and  let  S  =  {F  :  Fl  £  Sl ,1  =  1, ...  ,d}. 
For  a  given  matrix  F,  define  the  cost  function  C  :  S  — >  R  by 

n 

C(F)  =  ^£(yi,ui) 

i= 1 

where  u,,  are  the  row  vectors  of  F.  Then,  finding  w  which  minimizes  (7)  is  equivalent  to 
find  F  in  S  such  that 

F  =  arg  min  (7(F).  (8) 

F  Go 

The  main  idea  of  the  proposed  algorithm  is  to  find  F  sequentially  as  follows.  Rewrite 
F/  =  (f{ , . . . ,  fp; )  where  f^,s  are  n-dimensional  column  vectors.  For  a  given  £  Sl ,  let 
F[a,  Vz]  =  (F1, . . . ,  Fi_1,  F;  +  a(Vl  —  Fz),  Fi+1, . . . ,  F0*).  Suppose  that  F^  is  the  current 
estimate  of  Fl.  Then  for  each  l,  the  proposed  algorithm  searches  a  direction  matrix  V*  E  Sl 
such  that  C(F[o!,  V?])  decreases  most  rapidly  for  some  a  £  [0, 1]  and  updates  F  to  F[a,  V/]. 
Note  that  F[a,  V^]  is  still  in  S.  Now,  the  Taylor  expansion  implies 

pi 

C(F[a,  V*])  «  (7(F)  +  a  £  (VC(F)i,  v[.  -  f'), 

k=  1 

where  VC(F)^.  =  dC(F)/dik.  Here,  (a,  b)  =  aA  f°r  a>  b  £  Rn.  Moreover,  it  can  be 
easily  shown  that 

min  f]<VCTF)i,vi-4)=  min  min  {(VC(F)i,  z{  -  f‘>,  (VC'(F)i,  -z'  -  f')}  .  (9) 
vlesl  “  k=x,...,pi  l  J 

Hence,  the  desired  direction  matrix  V;  is  such  that  the  A;-th  column  vector  is  either  zk  or 
— zlk  whichever  minimizes  the  right  hand  side  of  (9)  and  the  other  column  vectors  are  0. 
By  summing  up  the  above  arguments,  we  propose  the  gradient  descent  algorithm  for  the 
generalized  LASSO  in  Figure  1. 

The  proposed  algorithm  is  equivalent  to  the  feasible  direction  method  (see,  for  example, 
Bertsekas  (1999))  for  the  standard  LASSO  model.  In  the  standard  LASSO  model,  d  =  1 
and  the  cost  function  C(F)  is  a  function  of  Y^k=i  ffc-  Then,  to  find  the  optimal  F  in  (8)  is 
equivalent  to  find  the  optimal  vector  h  defined  by 

h  =  argminhg5(7(h) 
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1.  Initialize  :  wl  =  0,  Fq  =  0  for  l  =  1, . . . ,  d  and  Fo  =  (Fj, . . . ,  Fq). 

2.  For  to  =  1, . . . ,  M  do  ; 

(a)  Compute  the  gradient  VC'(Fm_i) 

(b)  Find  the  (Z,fc,  7)  which  minimizes  inner  products 

k,  7)  =  (VC'(Fm_1)jc,  7z[),  l  =  1, . . . ,  d,  k  =  1, . . .  ,pt,  7  =  ±1, 

and  let  V;  be  the  n  x  matrix  such  that  the  /c-th  column  vector  of  V;  is  -yzl  and  the 
other  column  vectors  are  0. 

(c)  If  k ,  7)  =  0,  then  return  w  =  (w*, . . . ,  wd). 

(d)  Else,  find  d  =  argminae[0,i]  C  (Fm_i[a,  V*])  . 

(e)  Update  F  and  w: 

Fm  =  Fm_![d,Vf] 

!(1  —  d)wlk  +  7a  ,1  =  l,k  =  k 
(1  —  a)wlk  ,1  =  l,k  ^  k 

w{  ,  o.w. 

3.  Return  w  =  (w;, wd). 


Figure  1:  Gradient  descent  algorithm  for  generalized  LASSO 

where  S  =  {X^=i  wk'vk >  vfc  €  {zfc>  ~zk},  Wk  >  0,  Y^k=\  wk  =  !}•  Note  that  S  is  the  convex 
hull  of  Z  =  {zi, . . . ,  zp}  U  {— zi, . . . ,  —  zp}.  That  is,  the  problem  of  finding  h  is  an  optimiza¬ 
tion  problem  over  a  convex  set.  One  of  the  standard  methods  for  the  optimization  over  a 
convex  set  is  the  feasible  direction  method.  In  the  feasible  method,  for  given  current  solution 
h m,  we  find  v  G  S  which  minimizes  (VC'(hm),v)  on  v  G  S.  Then,  we  find  a  which  mini¬ 
mizes  C(hm  +  o;(v  —  hm))  on  a  €  [0, 1].  Finally,  we  update  hm  by  hm+i  =  hm  +  &(v  —  h.m). 
Since  argminveiS(VC'(hm),  v)  =  argminvg^(VC'(hm),  v),  the  feasible  direction  method  for 
the  standard  LASSO  model  is  the  same  as  the  proposed  algorithm. 

4.  Convergence  Analysis 

We  assume  that  C  is  convex  and  satisfies  the  Lipschitz  with  the  Lipschitz  constant  L  on 

5.  That  is,  for  any  two  matrices  whose  fc-th  column  vectors  are  gk  =  (gik,  ■  ■  ■  ,9nk)  and 
h u  =  ( hik ,  •  •  ■ ,  hnk )'  respectively, 

j|VC(G)  -  VC(H)||  <  L||G  -  H || 

where  VC(G)  =  dC( G)/dG,  ||G  —  H 1 1 2  =  J2k=i  X4L1 (5ife  “  hik)2-  We  assume  that  S  is 
a  bounded  set.  Let  Mi  =  L supGiH&s supfc=lj ...,vYJi=\{9ik  ~  hik)2 ■  Let  C*  =  infFe5C(F) 
and  let  AC(F)  =  C(F)  —  C*.  Let  M2  =  supFeS  AC(F),  and  let  M  =  max{Mi,M2}.  In 
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addition,  for  notational  convenience,  we  denote  that  (G,  H)  =  YJk=i  (§&>  h /. ).  The  following 
Theorem  is  the  main  result  of  this  section. 

Theorem  1 


AC  (Fm)  < 


2  d?M 
m 


(10) 


First  of  all,  we  start  with  the  following  two  lemmas  before  we  prove  theorem  1. 
Lemma  2  For  any  F  £  S,  Vz  £  Sl  and  a  £  [0, 1], 


C  (V[a,  V'])  <  C(F)  +  a(VC( F)z,  Vz  -  Fl)  + 
where  VC( F)1  =  dC{ F)/dFl. 

Proof.  Define  <f>  :  R  — >  R  by  <h(a)  =  (7(F[a,  Vz] ) .  Let  ^(a)  =  d&(a)/da.  Then,  we  have 

$'(a)  =  (VC^Ffa,  Vz])', \l  -  Fl). 


Hence, 


$'(a) 


<*>'(0)1  = 

< 

< 


(VC(F[a,  V1})1  -  VC{F)l,Vl  -  Fl) 

||VC(F[a,  V1])1  -  VC'(F)* ||  || Vz  -  Fl 
La\\Vl  -  Fz || 2 


for  a  £  [0, 1].  The  first  inequality  is  by  Cauchy- Schwarz  inequality.  Thus, 

$'(a)  <  $'(0)  +  La||V-F'||2 

=  (VC(F)*,  Vz  —  F^)  +  La\\Vl  —  Fz||2. 


Hence, 

<f>(a)  -  <f>(0)  =  I  <f >'(s)ds 

Jo 

<  ({X7C(F)l,Xl -Fl)  +  Ls\\Vl -Fl\\2^Jds 
=  a(VC(F);,V'  -Fl)  +  ^\\\l  -FZ||V. 

Since  V(  and  F^  are  in  Sl,  L||VZ  —  F*||2  <  M\  <  M,  which  completes  the  proof.  ■ 


Lemma  3  For  any  given  F  £  S,  let  be  a  matrix  in  Sl  which  minimizes  (VCY(F)i,  V;— F;) 
and  let  l  =  argmini=1 d{(VC(F)z, \l  —  F()}.  Let 

a  =  arg  min  C  (  F[a,  V;1  )  . 

a£[0,l]  V  / 


Then, 

Ac(F[d,Vi])<AC(F)-^gl. 


Proof.  For  given  F  G  S,  define  the  Bregman  divergence  B(W,  F)  on  W  G  S  by 


B(W,  F)  =  C(W)  -  C( F)  -  (VC'(F),  W  -  F) 

Note  that  infwes  B(W,  F)  >  0  for  all  F  G  S  since  C  is  convex. 
Now,  let  W  be  a  matrix  in  S.  Since 


(11) 


(VC'(F),  W  —  F)  =  (VC(F)i,  W*  -  Fl 


1=1 


we  have 


So, 


(VC'(F),  W  -  F)  >  (VC( F)l,Vl  -  F' 


i=i 


^inf  (VC(F),  W  -  F)  >  d(VC( F),  V  -  Fh 


Combining  (11)  and  (12),  we  have 


(VC(F),  V  -  Fl)  <  - 


AC(F) 
~d  ' 


Lemma  2  implies  that 


AC(F[a,  V])  <  A (7(F)  -  ^ a 2. 

(Z  At 


(12) 


(13) 


(14) 


By  taking  the  minimum  on  the  right  hand  side  of  (14)  with  respect  to  a  on  [0, 1],  we  will 
get  the  desired  result  since  AC(F)/dM  <  1  for  d  >  1.  ■ 

Proof  of  Theorem  1.  We  will  use  the  mathematical  induction.  First,  consider  the  case 
of  m  =  1.  Then, 


A<7(Fi)  <  M  <  d2M  < 


2d?M 


m 


and  so  Theorem  1  is  true  for  m  =  1. 

Next,  suppose  AC(Fm)  <  2 d2M/m  for  m  >  2.  Lemma  3  implies 


AC(Fm+i)<AC(Fm)- 


AC(Fm)5 


2  d2M  ' 

Since  x  —  x2 / (2d2 M)  is  increasing  on  [0,  d2M]  and  2 d2M/m  <  d2M  for  rn  >  2,  we  have 

2  d2M  (2  d2M)2 


AC  (Fm+1)  < 


m  2d2  Mm2 

2d?M  2d2M  2  d2M 

- 2“  ^  — T’ 

m  mz  m  +  1 


which  completes  the  proof. 
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Remark.  In  the  standard  LASSO  framework,  that  is  d  =  1  and  the  cost  function  C(F)  is  a 
function  of  we  explained  in  Section  3  that  the  proposed  gradient  descent  algorithm 

is  the  same  as  the  feasible  directional  method,  which  finds  the  optimal  convex  combinations 
of  vectors  Z  =  {z\, . . . ,  zp}U{— zi, . . . ,  —  zp}.  The  convergence  rates  of  algorithms  of  finding 
the  optimal  convex  combination  of  elements  in  the  set  Z  have  been  studied  by  many  authors 
including  Jones  (1992),  Barron  (1993)  and  Zhang  (2003).  In  particular,  Zhang  (2003) 
considered  the  sequential  greedy  approximation  algorithm  which  updates  the  current  convex 
combination  hm  to  the  new  one  hm+i  by  (1  —  d)hm  +  dv  where 

(d,  v)  =  arg  min  C  ((1  —  a)hm  +  av) . 

v£Z,a£[0,l] 

He  proved  that  AC(hm)  =  Since  the  sequential  greedy  optimization  algorithm  is 

the  fastest  sequential  algorithm,  Theorem  1  implies  that  the  convergence  rate  of  the  gradient 
descent  algorithm  for  the  generalized  LASSO  is  almost  optimal,  at  least  for  the  standard 
LASSO  model.  Note  that  in  the  sequential  greedy  approximation  algorithm,  finding  (d,  v) 
is  not  easy  because  C  ((1  —  a)hm  +  av)  is  not  jointly  convex  in  a  and  v. 

Remark.  The  bound  in  Theorem  1  implies  that  the  convergence  rate  of  the  proposed 
algorithm  does  not  depend  on  the  dimension  of  inputs.  Hence,  we  expect  that  the  proposed 
algorithm  works  well  with  large  dimensional  data.  This  important  feature  of  the  proposed 
algorithm  is  also  confirmed  empirically  in  Section  5. 

5.  Simulation 

In  this  section,  we  compare  the  performance  of  the  gradient  descent  algorithm  with  the 
conventional  QP  algorithm  by  simulation.  We  first  consider  the  multiple  linear  regression 
model  given  as 

U  =  A)  +  Pizi  +  ■  •  •  +  PpZp  +  e. 

For  experiments,  we  choose  the  number  of  input  variables  of  p  =  10,  20,  30,  50  and  100. 
The  inputs  z%  are  generated  independently  from  the  standard  normal  distribution.  Also,  e 
follows  the  standard  normal  distribution.  We  set  /3q  =  1.0,  (5\  =  —0.5,  /%  =  1.0,  /?4  =  0.5  and 
other  remaining  coefficients  are  set  to  zero.  We  repeat  the  experiment  100  times  with  the 
fixed  sample  size  50.  When  (5q  is  to  be  estimated,  as  we  explained  in  Section  2,  we  put  the 
constraint  on  as  |/?o|  <  Ao  and  let  x°  =  1.  Then,  we  apply  the  gradient  descent  algorithm 
with  the  augmented  inputs  x*  =  (x°,Xj).  If  Ao  is  sufficiently  large  that  the  LASSO  estimate 
flo  satisfies  \$o\  <  Ao,  we  get  the  desired  estimate.  For  the  choice  of  Ao,  we  recommend 
Ao  =  t]Pq  where  r/  >  1  and  /3q  is  the  minimizer  of  X^=i  A))-  Our  experience  confirms 
that  G  [2,  3]  works  well. 

Table  1  shows  the  number  of  iterations  for  varying  number  of  inputs  (p)  and  different 
constraints  (A  =  0.4,  0.8.  1.2,  1.6  and  2.0)  from  the  gradient  descent  algorithm.  This 
confirms  the  theoretical  result  in  Theorem  1  that  the  convergence  speed  of  the  gradient 
decent  algorithm  does  not  depend  on  the  number  of  input  variables. 

In  order  to  compare  the  gradient  descent  algorithm  with  the  QP  algorithm,  we  use 
llce()  function  in  R  system  which  implemented  by  Lokhorst  et  al.  (1999)  applying  the 
algorithm  of  Osborne  et  al.  (2000).  The  resulting  deviances  (the  empirical  risk  divided  by 
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the  sample  size)  of  the  two  LASSO  algorithms  are  summarized  in  Table  2.  The  deviance 
of  the  gradient  descent  algorithm  is  slightly  larger  than  that  of  the  QP.  This  is  because 
the  gradient  algorithm  is  an  approximation  method  in  the  sense  that  it  converges  to  the 
optimum  when  the  number  of  iteration  goes  to  infinity.  Actually,  we  stop  the  algorithm 
when  the  decrease  of  the  deviance  is  small  enough  (we  use  10”3).  The  deviance  of  the 
gradient  descent  algorithm  can  be  reduced  more  by  iterating  the  algorithm  more.  Note, 
however,  that  the  number  of  nonzero  coefficients  in  Table  3  are  almost  identical. 

For  the  multiple  logistic  regression  model,  we  consider  the  following  model  : 

/(Z)  =  bg  (l  -  l[z))  =  L°  "  2Z1  +  Z2  +  °'5"3- 

All  the  other  experimental  settings  are  the  same  as  those  for  the  multiple  regression.  The 
average  number  of  iterations  of  the  gradient  descent  algorithm  is  in  Table  4,  and  the  de- 
viances  and  the  non-zero  coefficients  are  found  in  Table  5  and  6.  These  results  are  as  similar 
as  those  of  the  multiple  linear  regression  model. 

Our  simulation  results  show  that  the  gradient  descent  algorithm  for  the  generalized 
LASSO  finds  the  (almost)  optimum  with  much  less  computation  than  the  QP  method. 
Moreover,  its  convergence  rate  does  not  depend  on  the  number  of  inputs. 

6.  Analysis  of  Pima  Indians  Diabetes  dataset  using  the  structural  sparse 
kernel  machine 

One  important  advantage  of  the  the  proposed  algorithm  is  to  be  able  to  analyze  the  model 
with  multiple  constraints  such  as  the  structural  sparse  kernel  machine  in  Example  3  of 
Section  3.  In  this  section,  we  analyze  the  Pima  Indians  Diabetes  data  using  the  structural 
sparse  kernel  machine.  The  data  set,  which  is  available  in  the  UCI  machine  learning  repos¬ 
itory  (  http://www.ics.uci.edu/~mlearn  /MLRepository.html  ),  has  768  observations  with 
8  input  variables  and  a  binary-class  output  variable.  The  8  input  variables  and  one  output 
variable  are  follows. 

1.  preg  :  Number  of  times  pregnant 

2.  glu  :  Plasma  glucose  concentration  (glucose  tolerance  test) 

3.  pres  :  Diastolic  blood  pressure  (mmHg) 

4.  tri  :  Triceps  skin  fold  thickness(mm) 

5.  insu  :  2-Hour  serum  insulin  (mu  U/ml) 

6.  mass  :  Body  mass  index  (weight  in  kg/(height  in  m)2) 

7.  pedi  :  Diabetes  pedigree  function 

8.  age  :  Age  (years) 

9.  diabetes  :  Class  output  variable  (500  ”  and  268  “+”  tests  for  diabetes) 
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V 

A  =  0.4 

A  =  0.8 

A  =  1.2 

A  =  1.6 

A  =  2.0 

10 

4.03  ±  0.13 

15.23  ±  3.02 

93.21  ±  6.97 

181.46  ±  4.34 

233.84  ±  3.71 

20 

4.04  ±  0.13 

15.42  ±  3.04 

92.53  ±  6.97 

180.16  ±  4.14 

234.52  ±  3.90 

30 

4.15  ±  0.15 

15.65  ±  2.98 

91.06  ±  6.89 

179.32  ±  4.16 

234.32  ±  3.66 

50 

4.21  ±  0.15 

15.49  ±  2.89 

90.58  ±  6.85 

182.40  ±  4.16 

235.81  ±  3.41 

100 

4.26  ±  0.14 

14.94  ±  2.66 

90.16  ±  6.83 

181.52  ±  4.17 

245.34  ±  3.39 

Table  1: 

Average  number  of  iterations  for  the  gradient  descent  algorithm 
sion 

on  linear  regres- 

V 

A  = 

=  0.4 

A  = 

=  0.8 

A  =  1.2 

A  = 

=  1.6 

A  = 

=  2.0 

Gradient  method 

10 

133.19 

± 

3.033 

105.76 

± 

2.457 

85.03  ±  1.997 

69.56 

± 

1.642 

58.15 

± 

1.363 

20 

133.19 

± 

3.033 

105.76 

± 

2.457 

85.00  ±  1.997 

69.42 

± 

1.644 

57.72 

± 

1.372 

30 

133.18 

± 

3.033 

105.71 

± 

2.458 

84.91  ±  1.998 

69.23 

± 

1.649 

57.30 

± 

1.382 

50 

133.15 

± 

3.037 

105.65 

± 

2.465 

84.74  ±  2.011 

68.83 

± 

1.668 

56.56 

± 

1.403 

100 

133.14 

± 

3.040 

105.56 

± 

2.468 

84.45  ±  2.018 

68.20 

± 

1.683 

55.37 

± 

1.423 

QP  METHOD 

10 

133.19 

± 

3.033 

105.74 

± 

2.459 

84.85  ±  2.003 

69.17 

± 

1.644 

57.64 

± 

1.363 

20 

133.19 

± 

3.033 

105.73 

± 

2.459 

84.81  ±  2.003 

69.01 

± 

1.647 

57.19 

± 

1.373 

30 

133.18 

± 

3.033 

105.69 

± 

2.460 

84.72  ±  2.004 

68.82 

dfc 

1.651 

56.77 

± 

1.384 

50 

133.15 

± 

3.037 

105.63 

± 

2.468 

84.55  ±  2.017 

68.42 

± 

1.670 

56.03 

± 

1.405 

100 

133.14 

± 

3.040 

105.54 

± 

2.471 

84.25  ±  2.024 

67.78 

± 

1.685 

54.85 

± 

1.424 

Table  2 

:  Average  deviances  of  the  gradient  descent  algorithm  and  QP  algorithm  on  linear 

regression 

V 

A  = 

=  0.4 

A  : 

=  0.8 

A  =  1.2 

A  = 

=  1.6 

A  =  2.0 

10 

2.87 

± 

0.06 

3.54 

± 

Gradient  method 

0.07  4.19  ±  0.08 

4.86 

± 

0.09 

5.57  ±  0.09 

20 

2.88 

± 

0.07 

3.58 

± 

0.07 

4.38  ±  0.10 

5.35 

± 

0.13 

6.63  ±  0.15 

30 

2.91 

± 

0.07 

3.65 

± 

0.09 

4.56  ±  0.13 

5.79 

± 

0.16 

7.47  ±  0.18 

50 

2.95 

± 

0.08 

3.79 

± 

0.10 

4.81  ±  0.15 

6.35 

± 

0.21 

8.46  ±  0.24 

100 

3.01 

± 

0.08 

4.10 

± 

0.15 

5.42  ±  0.22 

7.51 

± 

0.27 

10.26  ±  0.33 

10 

2.87 

± 

0.06 

3.54 

± 

QP 

0.07 

METHOD 

4.19  ±  0.08 

4.95 

± 

0.08 

5.66  ±  0.09 

20 

2.88 

± 

0.07 

3.58 

± 

0.07 

4.39  ±  0.10 

5.43 

± 

0.13 

6.80  ±  0.16 

30 

2.91 

± 

0.07 

3.64 

± 

0.08 

4.61  ±  0.14 

5.93 

± 

0.17 

7.61  ±  0.19 

50 

2.95 

± 

0.08 

3.78 

± 

0.10 

4.85  ±  0.15 

6.49 

± 

0.21 

8.68  ±  0.24 

100 

3.01 

± 

0.08 

4.12 

± 

0.16 

5.45  ±  0.22 

7.64 

± 

0.27 

10.48  ±  0.33 

Table  3:  The  number  of  non-zero  coefficients  of  the  gradient  descent  algorithm  and  QP 
algorithm  on  linear  regression 
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V 

A  =  1 

A  =  2 

A  =  3 

II 

A  =  5 

10 

9.79  ±  1.16 

63.19  ±  2.57 

99.87  ±  2.25 

125.87  ±  2.53 

140.93  ±  2.63 

20 

10.31  ±  1.15 

65.41  ±  2.51 

105.40  ±  2.10 

135.92  ±  2.04 

164.55  ±  2.38 

30 

10.38  ±  1.07 

65.68  ±  2.45 

109.57  ±  2.19 

142.53  ±  2.30 

173.34  ±  2.74 

50 

11.27  ±  1.08 

66.99  ±  2.43 

112.61  ±  2.36 

150.18  ±  2.25 

178.56  ±  2.59 

100 

12.59  ±  1.23 

70.57  ±  2.59 

119.53  ±  2.25 

155.03  ±  2.00 

185.11  ±  2.53 

Table  4: 

Average  number  of  iterations  for  the  gradient  descent  algorithm 
sion 

on  logistic  regres- 

V 

A 

= 

1 

A 

2 

A  =  3 

A 

4 

A 

5 

Gradient  method 

10 

25.78 

± 

0.13 

21.52 

± 

0.19 

18.93  ±  0.22 

17.22 

± 

0.25 

16.06 

± 

0.28 

20 

25.71 

± 

0.13 

21.23 

± 

0.18 

18.30  ±  0.20 

16.18 

± 

0.22 

14.58 

± 

0.23 

30 

25.68 

± 

0.13 

21.05 

± 

0.17 

17.90  ±  0.19 

15.57 

± 

0.20 

13.73 

± 

0.21 

50 

25.62 

± 

0.13 

20.78 

± 

0.16 

17.37  ±  0.18 

14.75 

± 

0.18 

12.66 

± 

0.18 

100 

25.46 

± 

0.12 

20.29 

± 

0.14 

16.52  ±  0.15 

13.58 

± 

0.16 

11.24 

± 

0.15 

QP  METHOD 

10 

25.77 

± 

0.13 

21.38 

± 

0.18 

18.70  ±  0.22 

16.93 

± 

0.25 

15.71 

± 

0.28 

20 

25.70 

± 

0.13 

21.08 

± 

0.17 

18.06  ±  0.20 

15.87 

± 

0.22 

14.20 

± 

0.23 

30 

25.67 

± 

0.13 

20.90 

± 

0.17 

17.66  ±  0.19 

15.25 

± 

0.20 

13.35 

dh 

0.21 

50 

25.61 

± 

0.12 

20.63 

± 

0.16 

17.12  ±  0.18 

14.44 

± 

0.18 

12.28 

± 

0.18 

100 

25.44 

± 

0.11 

20.14 

± 

0.14 

16.27  ±  0.15 

13.27 

± 

0.16 

10.86 

± 

0.15 

Table  5: 

Average  deviances  of  the  gradient  descent  algorithm  and  QP  algorithm  on 

logistic 

regression 

P 

A  =  1 

A  =  2 

A  =  3 

A  =  4 

A  =  5 

Gradient  method 

10 

3.02  ±  0.09 

4.78  ±  0.12 

6.35  ±  0.14 

7.34  ±  0.15 

8.26  ±  0.14 

20 

3.43  ±  0.12 

5.96  ±  0.17 

8.29  ±  0.19 

10.03  ±  0.20 

11.43  ±  0.23 

30 

3.70  ±  0.13 

6.77  ±  0.20 

9.47  ±  0.22 

11.93  ±  0.22 

13.71  ±  0.24 

50 

4.20  ±  0.16 

7.92  ±  0.22 

11.09  ±  0.25 

14.06  ±  0.27 

16.17  ±  0.29 

100 

4.87  ±  0.18 

9.69  ±  0.24 

13.67  ±  0.28 

16.71  ±  0.30 

19.31  ±  0.33 

QP  METHOD 

10 

3.04  ±  0.09 

4.94  ±  0.13 

6.52  ±  0.15 

7.69  ±  0.15 

8.61  ±  0.14 

20 

3.50  ±  0.13 

6.18  ±  0.18 

8.71  ±  0.19 

10.42  ±  0.21 

11.90  ±  0.23 

30 

3.79  ±  0.14 

6.96  ±  0.21 

10.05  ±  0.23 

12.29  ±  0.22 

14.10  ±  0.23 

50 

4.31  ±  0.16 

8.11  ±  0.22 

11.54  ±  0.25 

14.65  ±  0.27 

16.66  ±  0.29 

100 

5.27  ±  0.20 

10.12  ±  0.24 

14.12  ±  0.29 

17.17  ±  0.31 

19.61  ±  0.31 

Table  6:  The  number  of  non-zero  coefficients  of  the  gradient  descent  algorithm  and  QP 
algorithm  on  logistic  regression 
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The  model  starts  with  the  main  effects  and  second  order  interaction  terms  as  in  (5). 
We  set  two  tuning  parameters  (Ai  and  A2)  to  be  equal,  that  is,  Ai  =  A2  =  A  and  choose 
A  =  6,  which  is  selected  by  the  5-fold  cross-validation.  The  Gaussian  kernel  with  the  scale 
parameter  reciprocally  proportional  to  the  dimension  of  inputs  is  used. 

Figure  2  depicts  the  change  of  the  deviance  on  the  number  of  the  iteration,  which  shows 
that  the  proposed  gradient  descent  algorithm  converges  fairly  fast  in  the  early  iterations. 
Even  after  50  iterations,  the  deviance  is  very  close  to  the  optimum.  Though  not  presented, 
we  report  that  this  feature  is  observed  in  all  the  data  sets  we  analyzed.  So,  we  can  save 
computing  time  more  by  stopping  earlier  with  sacrificing  little  accuracy. 

Figure  3  presents  the  L\  norms  of  the  selected  components  which  are  considered  as 
the  importance  measure  of  the  input  variables.  Here,  L 1  norm  of  fj(zj)  is  defined  by 
E”=i  lEJLi  clj K,j  (zij ,  Zij) |  /n,  and  L\  norm  of  the  second  order  interaction  term  is  defined 
similarly.  One  interesting  finding  is  that  the  interaction  term  tri*ins  is  selected  as  the 
third  most  important  feature.  It  is  well  known  that  finding  important  (nonlinear)  interaction 
terms  is  hard.  The  combination  of  the  kernel  idea  and  proposed  algorithm  makes  it  simpler. 


7.  Gene  Selection  using  the  sparse  muticlass  logistic  regression 


In  this  section,  we  apply  the  proposed  algorithm  for  selecting  relevant  genes  by  analyzing 
microarray  data  with  the  sparse  multiclass  logistic  regression  model.  Let  z*  =  (zn , . . . ,  Z{p) 
be  gene  expression  levels  from  the  i-th  examples,  i  =  1, . . . ,  n,  and  for  each  example,  there  is 
a  corresponding  label  y*  G  {1,2 ,...,«/}  representing  the  specific  type  of  cancer.  The  model 
we  consider  is 


Pr  {yi  =  J>j) 


exP(/j(zi)) 

1  +  'L^=ie*P{fm(zi)) 


for  j  =  1, . . . ,  J  —  1,  and 


Pr(yi  =  J|zj) 


1 

1  +  Em=\  eXP  (/m(z*)) 


where 

fj(zi)  =  6^  +  pid^zn  4 - 1-  Ppd^Zip. 

For  identification  of  the  model,  we  let  E/=i  \o[P  =  1  and  pk  >  0  for  k  =  1, . . .  ,p.  We 

estimate  O^s  as  well  as  pk s  by  maximizing  the  log-likelihood  given  in  (6)  with  the  constraint 
Ylk= 1  Pk  —  Since  the  solution  of  pk s  are  sparse,  we  can  select  relevant  genes  whose  ps  are 
not  zero. 

The  above  model  can  be  embedded  into  the  generalized  LASSO  set-up  by  letting  j3^  = 
PkUj..  with  the  constraint 

(15) 

i=i  fc= 1 

Then,  the  estimates  of  ps  as  well  as  Os  can  be  recovered  from  the  estimates  of  /3s  by 

Pk  =  E/=i  \Pk' ]\  and  °k]  =  Pk]/Pk- 

Now,  we  analyze  the  small  round  blue  cell  tumors  of  children  data  set  in  Khan  et  al. 
(2001),  which  consists  of  the  expression  levels  of  2,308  genes  and  4  class  labels  -  4  cancer 
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No.  of  iterations 


Figure  2:  Deviance  vs.  the  number  of  iterations  for  the  gradient  descent  algorithm  on  the 
structural  sparse  kernel  machine 
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Figure  3:  The  Li-norms  of  the  selected  components  using  the  gradient  descent  algorithm 
on  the  structural  sparse  kernel  machine 
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types  including  neuroblastoma  (NB),  rhabdomyosarcoma  (RMS),  non-  Hodgkin  lymphoma 
(NHL)  and  the  Ewing  family  of  tumors  (EWS).  The  data  set  is  divided  into  training  set 
with  63  examples  and  test  set  with  20  examples.  All  the  expression  levels  are  standardized 
to  be  centered  at  0  and  have  unit  variance. 

The  regularization  parameter  A  is  chosen  by  the  leave-one-out  cross  validation.  We  used 
the  two  measurement  for  selecting  the  optimal  A  -  CV\  and  CV \  defined  as 


CV,  =  £ 

i 

CV2  =  E 


1  -  I  (  yi  =  arg  max  f\  (zj 


=3)fj  *](zi)  +log  e*p(£/j  '](z t)) 


(16) 

(17) 


\—i] 

where  /■  is  the  estimate  of  fj  based  on  the  cross  validation  data  set  without  the  z-tli 
example.  That  is,  C'Vj  uses  the  0-1  loss  and  CV 2  uses  the  negative  log-likelihood  loss. 
Though  CV 1  is  a  standard  measurement  for  the  cross  validation,  we  tried  CV 2  because  it 
behaves  more  nicely  as  a  function  of  A.  Figure  4  shows  the  results  over  A  =  0.1,  0.2,  0.5  ,1, 
2,  5,  10,  20,  50,  100,  200,  and  500.  While  CV 1  has  multiple  minima  around  A  in  between 
20  and  50,  CV 2  has  the  unique  minimum  at  A  =  50.  We  conjecture  that  this  is  because  the 
variation  of  CV 1  is  larger  than  that  of  CV 2.  Also,  the  test  errors  of  the  models  selected  by 
CV 1  and  CV 2  are  the  same  -  one  out  of  20  test  set  examples.  Hence,  we  recommend  CV 2 
when  the  main  purpose  is  to  select  the  value  of  the  regularization  parameter  but  not  to 
estimate  the  generalization  error. 

Reliability  of  the  selected  genes  by  the  generalized  LASSO  is  another  important  issue. 
With  A  =  20  selected  by  CV±,  72  genes  were  selected,  and  with  A  =  50  selected  by  either 
CV 1  or  CV2,  102  genes  were  selected.  Considering  the  fact  that  all  inputs  were  standardized, 
we  can  regard  the  coefficient  p^  as  the  importance  measure  of  the  k-th.  gene.  We  compare 
Pk  with  the  gene  rank  based  on  the  marginal  F-ratio,  which  is  a  popular  filtering  technique 
(Dudoit  and  Speed,  2002;  Lee  et  ah,  2004).  Figure  5  shows  the  plots  of  ps  vs.  the  gene 
ranks  by  the  marginal  F-ratios  with  A  =  20  and  A  =  50,  respectively.  In  general,  the 
sparse  multilcass  logistic  model  yields  large  values  of  pk  to  genes  whose  ranks  based  on  the 
marginal  F-ratio  are  higher.  However,  surprisingly,  one  gene  whose  marginal  rank  is  lower 
than  500  is  selected  with  relatively  large  pk  by  the  the  sparse  multilcass  logistic  model.  The 
standard  procedure  for  constructing  a  classification  model  with  gene  expression  data  is  first 
to  select  small  to  medium  number  of  genes  using  a  filtering  technique  such  as  the  marginal 
F-ratio,  and  then  to  construct  a  predictive  model  based  on  the  pre-selected  genes.  The 
main  reason  for  using  this  two-stage  procedure  is  that  when  the  number  of  candidate  genes 
are  too  large,  the  most  of  classification  algorithms  have  computational  problems.  However, 
our  results  find  negative  evidence  for  this  standard  two-stage  procedure,  and  implies  that 
efficient  computing  algorithms  are  essential  for  constructing  a  good  predictive  model  with 
gene  expression  data  set.  The  proposed  algorithm  is,  of  course,  one  of  such  algorithms. 


8.  Conclusions 

In  this  paper,  we  proposed  the  gradient  descent  method  for  the  generalized  LASSO.  The 
proposed  algorithm  is  computationally  simpler  and  faster  than  the  conventional  QP  method. 
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Figure  4:  Results  of  the  small  round  blue  cell  tumors  of  children  data  set  by  the  sparse 
multiclass  logistic  model.  Left  Top:  The  CV\  plot  with  its  minimizers  marked 
by  circles.  Right  Top:  The  C\ 2  plot  with  its  minimizer  marked  by  circle.  Left 
Bottom:  The  number  of  nonzero  p^-  The  dotted  line  is  drawn  at  the  A  =  20 
chosen  by  CV\.  The  dashed  line  is  drawn  at  the  A  =  50  chosen  by  either  CV\  or 
CV 2.  Right  Bottom:  The  number  of  misclassified  examples  with  there  minimizers 
marked  by  circles. 
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Figure  5:  The  estimated  value  of  pk  sorted  by  the  rank  of  the  marginal  F-ratio.  Top:  the 
result  with  A  =  20  chosen  by  CV\ .  Bottom:  the  result  with  A  =  50  chosen  by 
either  CV\  or  CV2. 
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We  also  showed  theoretically  as  well  as  empirically  that  the  proposed  algorithm  converges 
fairly  fast  and  gives  reliable  results 

One  interesting  feature  of  the  proposed  algorithm  is  that  the  convergence  rate  is  inde¬ 
pendent  on  the  dimension  of  inputs.  The  convergence  rate  is  important  since  less  iteration 
gives  more  sparse  solutions.  Hence,  the  proposed  algorithm  is  well  suited  with  problems 
with  large  dimensional  inputs  such  as  the  sparse  kernel  machine  and  the  analysis  of  the  gene 
expression  data  set.  In  the  analysis  of  the  Pima  Indians  Diabetes  data  set,  the  algorithm 
converges  to  the  near  optimum  only  after  50  iterations.  This  is  surprising  since  the  number 
of  components  is  36.  Also,  the  final  model,  which  contains  only  8  components  out  of  36, 
which  is  very  sparse.  Another  advantage  of  our  algorithm  is  that  it  does  not  require  any 
additional  matrix  operations  in  the  preprocessing  step  which  is  necessary  when  the  design 
matrix  is  singular. 

In  the  gene  expression  data,  we  can  select  around  100  genes  out  of  more  than  2,000 
genes  using  the  proposed  algorithm  with  the  sparse  multiclass  logistic  regression  model.  In 
particular,  our  approach  does  not  need  any  pre-filtering  methods.  Also,  our  result  shows  a 
possibility  that  a  pre-filtering  method  can  delete  a  significant  gene.  Within  our  knowledge, 
our  analysis  is  the  first  of  its  kind  for  gene  expression  data  analysis. 

One  problem  of  the  proposed  algorithm  is  that  the  convergence  speed  is  rather  slow  at 
the  near  optimum.  Figure  2  represents  the  typical  behavior  of  the  deviance.  That  is,  the 
deviance  is  dropped  very  fast  in  the  early  stage,  and  then  it  decreases  slowly.  This  is  one 
reason  why  the  deviances  of  the  proposed  algorithm  are  slightly  larger  than  those  of  the 
QP  in  our  simulation  results.  Accelerating  the  convergence  speed  at  the  near  optimum  is 
worth  pursuing. 
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